Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | Go Back
View Post [edit]

Poster: Meridian01 Date: Dec 6, 2008 9:17pm
Forum: texts Subject: Duplicate files/books?

I'm wondering what the best method for searching out duplicates is. I would like to submit some pdfs I made of old, out of print Chinese books on TaiChi. I'm not sure how to search for items that have no isbn or a title in English.

I'm also wondering about duplication between the different organisations on the web that are digitally transcribing books. I know that Google seems to be a contributor and sponsor here, but what about the Gutenberg project and some of the other earlier pioneers? Are those other collections absorbed into IA's servers or are they merely linked? I looked through the FAQ and and tried searching for information about this, but to no avail.

Reply to this post
Reply [edit]

Poster: Horatius Date: Dec 7, 2008 6:57am
Forum: texts Subject: Re: Duplicate files/books?

Hallo! I've just joined. Perhaps I duplicate a question :) If so, sorry. Some weeks ago I DL this text from Googlebooks: http://www.archive.org/details/sullestradeferr00blasgoog.
And insert the text in italian Wikisource. The text has been carefully clearedfrom OCR mistakes and now I think is rather better than the one in IA. So why IA don't download from Wikisource and change it for a better quality? Copyright / GFDL license problems?

Reply to this post
Reply [edit]

Poster: stbalbach Date: Dec 7, 2008 8:31am
Forum: texts Subject: Re: Duplicate files/books?

Well, I think the idea with Internet Archive is that when users such as yourself create something new (such as a corrected OCR text) you can upload it to Internet Archive. Just upload the new text and it will be here. It won't show up in the original book page, because that is a different work, but you could post in the review comment section of that book letting users know that another version exists on Internet Archive (the one you uploaded).

Another option is send your text to Project Gutenberg since that is what they do, correct and edit OCR and publish as text. Project Gutenberg texts then get uploaded to Internet Archive. That's another way.

Reply to this post
Reply [edit]

Poster: Horatius Date: Dec 8, 2008 12:23am
Forum: texts Subject: Re: Duplicate files/books?

Thanks for answering. I have another concept of "quality" and think useless maintain a file of very bad quality when a good one is present. Of course the good one can (or must) be ameliorated. Another problem (perhaps) comes from GFDL license used by Wikisource. It allows also commercial use and I ask muyself if it's compatible with copyright politics of IA & Co. Bye!

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or Staffgirl2k Date: Dec 8, 2008 2:18pm
Forum: texts Subject: Re: Duplicate files/books?

You point to an important point, but, unfortunately, there is no way for the Archive to incorporate corrected OCR into its book files. It is a problem we will solve in the future, but currently it cannot be done. The advice provided above is best for now.

Reply to this post
Reply [edit]

Poster: stbalbach Date: Dec 7, 2008 6:03am
Forum: texts Subject: Re: Duplicate files/books?

Can you give an example book your trying to search for duplicates?

Gutenberg and Google books are uploaded here. There appear to be over half a million Google scans, and I suppose all of Gutenberg (I have not checked, but Gutenberg only has around 25k books or so).

Although there are a lot of "duplicates", keep in mind each book scan is unique, since there are things like marginalia (notes in margins), different editions, different quality of book condition, scan quality, etc.. I know personally when searching for a particular book I appreciate having many scans to choose from.